Search CORE

179 research outputs found

GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest

Author: A Alibés
A Liaw
B Efron
C Ambroise
C Strobl
EJ Kontoghiorghes
H Sutter
I Foster
I Medina
J Dongarra
KH Pan
L Ein-Dor
NL Pochet
P Pacheco
R Development Core Team
R Diaz-Uriarte
R Díaz-Uriarte
R Díaz-Uriarte
R Simon
Ramón Diaz-Uriarte
RL Somorjai
S Dudoit
S Dudoit
S Michiels
S Patel
S Varma
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Microarray data are often used for patient classification and gene selection. An appropriate tool for end users and biomedical researchers should combine user friendliness with statistical rigor, including carefully avoiding selection biases and allowing analysis of multiple solutions, together with access to additional functional information of selected genes. Methodologically, such a tool would be of greater use if it incorporates state-of-the-art computational approaches and makes source code available. Results We have developed GeneSrF, a web-based tool, and varSelRF, an R package, that implement, in the context of patient classification, a validated method for selecting very small sets of genes while preserving classification accuracy. Computation is parallelized, allowing to take advantage of multicore CPUs and clusters of workstations. Output includes bootstrapped estimates of prediction error rate, and assessments of the stability of the solutions. Clickable tables link to additional information for each gene (GO terms, PubMed citations, KEGG pathways), and output can be sent to PaLS for examination of PubMed references, GO terms, KEGG and and Reactome pathways characteristic of sets of genes selected for class prediction. The full source code is available, allowing to extend the software. The web-based application is available from <url>http://genesrf2.bioinfo.cnio.es</url>. All source code is available from Bioinformatics.org or The Launchpad. The R package is also available from CRAN. Conclusion varSelRF and GeneSrF implement a validated method for gene selection including bootstrap estimates of classification error rate. They are valuable tools for applied biomedical researchers, specially for exploratory work with microarray data. Because of the underlying technology used (combination of parallelization with web-based application) they are also of methodological interest to bioinformaticians and biostatisticians.</p

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Biblos-e Archivo

Asterias: integrated analysis of expression and aCGH data using an open-source, web-based, parallelized software suite

Author: Alibés Andreu
Cañada Andrés
Díaz-Uriarte Ramón
Morrissey Edward R.
Neves Mariana L.
Rueda Oscar M.
Publication venue: Oxford University Press
Publication date: 01/01/2007
Field of study

Asterias (http://www.asterias.info) is an open-source, web-based, suite for the analysis of gene expression and aCGH data. Asterias implements validated statistical methods, and most of the applications use parallel computing, which permits taking advantage of multicore CPUs and computing clusters. Access to, and further analysis of, additional biological information and annotations (PubMed references, Gene Ontology terms, KEGG and Reactome pathways) are available either for individual genes (from clickable links in tables and figures) or sets of genes. These applications cover from array normalization to imputation and preprocessing, differential gene expression analysis, class and survival prediction and aCGH analysis. The source code is available, allowing for extention and reuse of the software. The links and analysis of additional functional information, parallelization of computation and open-source availability of the code make Asterias a unique suite that can exploit features specific to web-based environments

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

PubMed Central

Biblos-e Archivo

Asterias: A Parallelized Web-based Suite for the Analysis of Expression and aCGH Data

Author: Alibés Andreu
Casado David
Cañada Andrés
Díaz-Uriarte Ramón
Morrissey Edward R.
Rueda Oscar M.
Yankilevich Patricio
Publication venue: Libertas Academica
Publication date: 01/01/2007
Field of study

The analysis of expression and CGH arrays plays a central role in the study of complex diseases, especially cancer, including finding markers for early diagnosis and prognosis, choosing an optimal therapy, or increasing our understanding of cancer development and metastasis. Asterias (http://www.asterias.info) is an integrated collection of freely-accessible web tools for the analysis of gene expression and aCGH data. Most of the tools use parallel computing (via MPI) and run on a server with 60 CPUs for computation; compared to a desktop or server-based but not parallelized application, parallelization provides speed ups of factors up to 50. Most of our applications allow the user to obtain additional information for user-selected genes (chromosomal location, PubMed ids, Gene Ontology terms, etc.) by using clickable links in tables and/or figures. Our tools include: normalization of expression and aCGH data (DNMAD); converting between different types of gene/clone and protein identifiers (IDconverter/IDClight); filtering and imputation (preP); finding differentially expressed genes related to patient class and survival data (Pomelo II); searching for models of class prediction (Tnasas); using random forests to search for minimal models for class prediction or for large subsets of genes with predictive capacity (GeneSrF); searching for molecular signatures and predictive genes with survival data (SignS); detecting regions of genomic DNA gain or loss (ADaCGH). The capability to send results between different applications, access to additional functional information, and parallelized computation make our suite unique and exploit features only available to web-based applications

Directory of Open Access Journals

PubMed Central

Biblos-e Archivo

SignS: a parallelized, open-source, freely available, web-based tool for gene selection and molecular signatures for survival and censored data

Author: A Alibés
C Ambroise
C Hughes
D Turek
EJ Kontoghiorghes
F Harrell
H Li
H Li
H Sutter
HMM Bøvelstad
Hothorn
I Foster
J Dongarra
J Gui
J Gui
J Klein
J Waldo
K Asanovic
KF Fogel
KH Pan
L Kaderali
M Reich
M Schumacher
MR Segal
N Sha
P Bühlmann
P Graham
P Pacheco
P Van Roy
PJ Park
R Bair
R Development Core Team
R Diaz-Uriarte
R Díaz-Uriarte
R Díaz-Uriarte
R Simon
Ramon Diaz-Uriarte
RL Somorjai
S Dudoit
S Ma
S Ma
S Ma
S Varma
SM Baxter
SS Dave
T Hothorn
T Hothorn
T Hothorn
WN van Wieringen
Y Pawitan
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Censored data are increasingly common in many microarray studies that attempt to relate gene expression to patient survival. Several new methods have been proposed in the last two years. Most of these methods, however, are not available to biomedical researchers, leading to many re-implementations from scratch of ad-hoc, and suboptimal, approaches with survival data. Results We have developed SignS (Signatures for Survival data), an open-source, freely-available, web-based tool and R package for gene selection, building molecular signatures, and prediction with survival data. SignS implements four methods which, according to existing reviews, perform well and, by being of a very different nature, offer complementary approaches. We use parallel computing via MPI, leading to large decreases in user waiting time. Cross-validation is used to asses predictive performance and stability of solutions, the latter an issue of increasing concern given that there are often several solutions with similar predictive performance. Biological interpretation of results is enhanced because genes and signatures in models can be sent to other freely-available on-line tools for examination of PubMed references, GO terms, and KEGG and Reactome pathways of selected genes. Conclusion SignS is the first web-based tool for survival analysis of expression data, and one of the very few with biomedical researchers as target users. SignS is also one of the few bioinformatics web-based applications to extensively use parallelization, including fault tolerance and crash recovery. Because of its combination of methods implemented, usage of parallel computing, code availability, and links to additional data bases, SignS is a unique tool, and will be of immediate relevance to biomedical researchers, biostatisticians and bioinformaticians.</p

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Biblos-e Archivo

The detection and location estimation of disasters using Twitter and the identification of Non-Governmental Organisations using crowdsourcing

Author: B Jongman
CC Aggarwal
CH Lee
F Atefeh
HM Saleem
HP Kriegel
J Capdevila
J Sander
J Weng
JP De Albuquerque
K Sparck Jones
M Kremer
M Sokolova
NV Chawla
O Ozdikis
PM Landwehr
R Díaz-Uriarte
R Rifkin
S Unankard
SM Omohundro
T Cheng
T Sakaki
TBN Hoang
W Sherchan
WH Smith
X Zhou
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/07/2020
Field of study

status: publishe

Lirias

Crossref

Edinburgh Research Explorer

Performance of random forest when SNPs are in linkage disequilibrium

Author: A Bureau
C Strobl
DF Schwarz
DJ Schaid
EM Reiman
JH Friedman
K Nicodemus
Kathryn L Lunetta
KJ Archer
KL Lunetta
L Adrienne Cupples
L Breiman
L Breiman
L Breiman
L Breiman
Lindsay A Farrer
N Risch
R Díaz-Uriarte
S Purcell
Y Freund
Y Meng
Yan A Meng
Yi Yu
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF. Results We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype. Conclusion Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.</p

Crossref

Boston University Institutional Repository (OpenBU)

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The efficacy of various machine learning models for multi-class classification of RNA-seq expression data

Author: A Statnikov
AT Azar
G Bartsch
G Sanz
H Rhee
I Ezkurdia
J Friedman
J Meng
J Zhi
JN Weinstein
L Breiman
M Al-Rajab
M Villamizar
MD Podolsky
MS Lawrence
ND Khalilabad
P Geurts
R Díaz-Uriarte
S Bram Ednersson
S Tarek
T Cover
X Li
Y Perez-Riverol
Y Shang
Y Tan
Z Ye
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 19/08/2019
Field of study

Late diagnosis and high costs are key factors that negatively impact the care of cancer patients worldwide. Although the availability of biological markers for the diagnosis of cancer type is increasing, costs and reliability of tests currently present a barrier to the adoption of their routine use. There is a pressing need for accurate methods that enable early diagnosis and cover a broad range of cancers. The use of machine learning and RNA-seq expression analysis has shown promise in the classification of cancer type. However, research is inconclusive about which type of machine learning models are optimal. The suitability of five algorithms were assessed for the classification of 17 different cancer types. Each algorithm was fine-tuned and trained on the full array of 18,015 genes per sample, for 4,221 samples (75 % of the dataset). They were then tested with 1,408 samples (25 % of the dataset) for which cancer types were withheld to determine the accuracy of prediction. The results show that ensemble algorithms achieve 100% accuracy in the classification of 14 out of 17 types of cancer. The clustering and classification models, while faster than the ensembles, performed poorly due to the high level of noise in the dataset. When the features were reduced to a list of 20 genes, the ensemble algorithms maintained an accuracy above 95% as opposed to the clustering and classification models.Comment: 12 pages, 4 figures, 3 tables, conference paper: Computing Conference 2019, published at https://link.springer.com/chapter/10.1007/978-3-030-22871-2_6

arXiv.org e-Print Archive

Crossref

Preliminary results from the ECOCADIZ 2020-07 Spanish acoustic survey (01 – 14 August 2020)

Author: Amorim P. (Pedro)
Angélico M.M. (María Manuel)
Boyra G. (Guillermo)
Campanella F. (Fabio)
Domínguez-Petit R. (Rosario)
Doray M. (Mathieu)
Duhamel E. (Erwan)
Díaz-Conde M.P. (María Paz)
Huret M. (Martin)
Ibaibarriaga L. (Leire)
Kooij J. (Jeroen) van der
Moreno A. (Ana)
Nunes C. (Cristina)
O'Donnell C. (Ciaran)
Ramos F. (Fernando)
Riveiro I. (Isabel)
Rodríguez-Climent S. (Silvia)
Santos M. (María)
Uriarte A. (Andrés)
|Carrera P. (Pablo)
Publication venue: Centro Oceanográfico de Cádiz
Publication date: 01/01/2021
Field of study

The present working document summarises a part of the main results obtained from the Spanish (pelagic ecosystem-) acoustic survey conducted by IEO between 01st and 14th August 2020 in the Portuguese and Spanish shelf waters (20-200 m isobaths) off the Gulf of Cadiz (GoC) onboard the R/V Miguel Oliver. The 21 foreseen acoustic transects were sampled. A total of 26 valid fishing hauls were carried out for echo-trace ground-truthing purposes. Four additional night trawls were conducted to collect anchovy hydrated females (DEPM). This working document only provides abundance and biomass estimates for anchovy, sardine and chub mackerel, which are presented without age structure. The distribution of all the mid-sized and small pelagic fish species susceptible of being acoustically assessed is also shown from the mapping of their back-scattering energies. GoC anchovy acoustic estimates in summer 2020 were of 5153 million fish and 44 877 tones, with the bulk of the population occurring in the Spanish waters. The current biomass estimate becomes in the second historical maximum within the time-series. The estimates of sardine abundance and biomass in summer 2020 were 1923 million fish and 50 721 t, estimates close to the historical average, but lower than the values estimated last year and the most recent maxima reached in 2018. A total of 32 854 t and 448 million fish were estimated for Chub mackerel, estimates similar to the most recent ones and very close to the time-series average

ArchiMer - Institutional Archive of Ifremer

Digital.CSIC

Repositorio Institucional Digital del IEO

An experimental study of the intrinsic stability of random forest variable importance measures

Author: A Altmann
A Kalousis
A Statnikov
A Statnikov
A Verikas
AC Haury
AL Boulesteix
AL Boulesteix
CH Park
D Ma
DM Reif
DS Cao
EC Fieller
Fan Yang
H Wang
Huazhen Wang
I Guyon
I Kamkar
J Paul
JM Cadenas
KK Nicodemus
L Breiman
L Hamers
L Yu
L Yu
LI Kuncheva
MB Kursa
ML Calle
O Okun
R Díaz-Uriarte
R Fagin
R Genuer
S Alelyani
S Loscalzo
S Pleus
SS Lee
SY Kim
TK Ho
VY Kulkarni
Y Han
Y Zhang
Z He
Zhiyuan Luo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

BACKGROUND: The stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. The motivation of this study is two-fold. First, we empirically verify the prevalence of intrinsic stability of VIMs over many real-world datasets to highlight that the instability of VIMs does not originate exclusively from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. Second, through Spearman and Pearson tests we comprehensively investigate how different factors influence the intrinsic stability. RESULTS: The experiments are carried out on 19 benchmark datasets with diverse characteristics, including 10 high-dimensional and small-sample gene expression datasets. Experimental results demonstrate the prevalence of intrinsic stability of VIMs. Spearman and Pearson tests on the correlations between intrinsic stability and different factors show that #feature (number of features) and #sample (size of sample) have a coupling effect on the intrinsic stability. The synthetic indictor, #feature/#sample, shows both negative monotonic correlation and negative linear correlation with the intrinsic stability, while OOB accuracy has monotonic correlations with intrinsic stability. This indicates that high-dimensional, small-sample and high complexity datasets may suffer more from intrinsic instability of VIMs. Furthermore, with respect to parameter settings of random forest, a large number of trees is preferred. No significant correlations can be seen between intrinsic stability and other factors. Finally, the magnitude of intrinsic stability is always smaller than that of traditional stability. CONCLUSION: First, the prevalence of intrinsic stability of VIMs demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. This finding gives a better understanding of VIM stability, and may help reduce the instability of VIMs. Second, by investigating the potential factors of intrinsic stability, users would be more aware of the risks and hence more careful when using VIMs, especially on high-dimensional, small-sample and high complexity datasets

Crossref

Springer - Publisher Connector

Royal Holloway - Pure

PubMed Central

Bias in random forest variable importance measures: Illustrations, sources and a solution

Author: A Bureau
A Dobra
A Liaw
Achim Zeileis
AG Heidema
AL Boulesteix
AL Boulesteix
Anne-Laure Boulesteix
C Furlanello
C Strobl
C Strobl
C Strobl
Carolin Strobl
DN Politis
EC Gunther
H Kim
I Kononenko
J Friedman
J Friedman
K Arun
KL Lunetta
L Breiman
L Breiman
L Breiman
M van der Laan
MM Ward
MP Cummings
MP Cummings
MR Segal
P Bühlmann
PJ Bickel
R Development Core Team
R Díaz-Uriarte
R Guha
T Hothorn
T Hothorn
TM Therneau
Torsten Hothorn
V Svetnik
X Huang
Y Qi
Y Shih
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. RESULTS: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. CONCLUSION: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Open Access LMU

Elektronische Publikationen der Wirtschaftsuniversität Wien